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THE DESIGN OF SUPER FOONLY 

Speaker! David Pool* Stanford Artificial Intelligence Lab. 

Abstract! Super Foonly is a Stanford designed processor 

*ith a PDP-^iO instruction set but that will perform most 
Instruction sequences an order of magnitude faster than a KliO, 
Host of its performance derives fro* its ?K cache and its 
capabilities are extended by a paging box. 

OVERVIEW 

Super Foonly was generated at the Stanford AI lab as the 
result of the familiar problem of running out of compute power. 
Our PDP-10 is connected to 30 display terminals* 3330 disk 
drives and special purpose hardware like TV displays and so it 
usual 1 y becomes overburdened in the early afternoons* when we 
first noticed the problem a couple of years ago we discussed it 
with DfcC but decided that the then quoted price performance 
improvement of the K 1 1 was unimpressive, After some of us had 
"anaged to learn some details about large computer design form 
IBM personnel* we decided that we could build it ourselves, 
*fter a couple of months of preliminary work* two Of us proposed 
to design and construct a plug in replacement of the KA10 which 
<ou!d use the memories already in house* the same peripherals* 
ind require no change to user programs but that would provide a 
Paging box and effect a program execution speedup of a factor of 
10 over the KAi0 f 

An average of 3 1/2 of us have been working on it for 2 
rears now although the original intention was that it would 
'•Qui re Just one year, We are currently in the final design 
*tages and are busy laying out PC boards and wire lists. In 
terly '73 we will be ordering parts* expect to have it 
instructed by June and debugged about 6 months later, 

The machine will work with standard PDP-10 memories but 
f oonly operates four of them in parallel. Most Of our memories 
lf * Ampex supplied* double linked* 74 bits (currently being 
op «rated in 1/2 word mode), 

The project started wt th . simulations to which we fed 
!?•! programs chosen from those running at the lab, The 
''"ulations predicted execution of 3 instructions per 
^crosecond compered to measurements on our KA10 of slightly 
.* 8 s than t every microsecond. The speedup is the main attribute 
ut we introduced a few other features into the design. 



MICROCODE 

The machine is controlled by microcode Implemented such 
that it Is possible for some users to write modi fleet ions to 1t# 
perhaps to define new instructions, There are 512 words of 
writeable storage for the microcode of which 200 are permanently 
allocated for standard instruction 1 nterpret at 1 on« A user 
might also desire to Place the inner loops of some long 
processes in the microstore which may give up to a factor of 2 
additional performance as a result of eliminating instruction 
fetching and preparing. The decision to use microcode was 
influenced by Gordon Bell. 

NEW ADDRESSING MODES 

Other features users will see are two additional modes 
for indexing by the program counter and extended addressing {2i 
bits), In the former any indexing on register 17 will result in 
indexing by the program counter instead* In extended 
addressing mode the effective address 1s computed by adding the 
right hand 22 bits of an index register to the right hand 18 
bits of an instruction or for indirect addressing the right 22 
bits of the indirect word are used plus indexing if any (the 
indirect and indexing fields of the indirect word are shifted 
left), The program counter is also extended, AH such memory 
references are in the virtual space and will be mapped by the 
paging box into real memory, 

CONSOLE COMPUTER 

The console will be replaced by a plug with which you 
connect Foonly to another computer known as the console 
computer. The latter will have a display and be capable of 
examining all the interior flip-flops of foonly that would 
normally be connected to console lights (over 3^000 of them), 
It can also push the start? stop? deposit and examine switches^ 
The main applications will be in debugging ang maintenance but 
also for roonly startup as its microcode memories are initially 
blank. We intend to make our current PDPM0 the console for 
Foonly, We expect to prepare simulators for the console 
machine that can single step Foonly through a program while 
running its own simulator over the same program to detect the 
first point of divergence. The console computer idea was 
suggested by John McCarthy, 

DESIGN AUTOMATION 

To aid in the design we generated a set of design 
automation programs to aid the designer in an interactive mode 
at a graphic display console. The programs help to ley out the 
circuit boards (hard copy is produced on a plotter) and ther 
automatically prepare the wire lists, 

SPONSORSHIP 

Super Foonly is being sponsored by the Advanced Reseerc 
Projects Agency of the Department of Defense not as a reaeare 
project but simply as a solution to the problem of getting mor 
computer power in the most convenient way. We feel the desig 
has a lot more low clevernesss in it than other computer design 
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but we don't claim to be pioneering any new concepts 1n computer 
organization. We would very much like to see DEC take over the 
construction* service and support of Super Foonly's but users 
must Influence them to do so. 

Of What 1s the order of magnitude of Foonly's cost? 

At We expect to spend $200*000 to $250*000 for the 

manufacture of hardware, The design and personnel costs are not 

Included, 

Qt You are talking esssentlally what looks like early KI10 

prices plus the usual assortment of DEC peripherals so that for 

nice KI10 Prices today you are talking about an order of 

magnitude better, 

At Hopefully so* and of course we're also talking about not 

having to buy all those peripherals, The idea 1s that you 

unplug your KA10 and plug In the Super Foonly r your same old 

peripherals will keep working, 

Qt Where did the name come from? 

At Super Foonly has no acrnymlc op historic significance, 

Qt What class of gates are you using? What cycle times* 

etc,? 

At We are using Tl Shotkey TTL, If we were going to start 

it now we would use the new brand of ECU instead. These gates 

have 5 nanosecond worst case delay compared to 12 nanosecond 

worst case delay for the KI 1 gates* so the internal gates are 2 

to 2 1/2 times faster, Of course the secret of the machine's 

speed is in the cache memory, 

Qt What is the microprogram cycle time? 

At The internal machine cycle design goal Is about 100 

monoseconds, 12b fronoseconds was used in the simulations that 

produced factor of 10 processing speedup estimates, 

CACHE MEMORY 

The secret of Super Foonly's speed is the chache or 
buffer memory (similar to that in IBM's new machines) which is a 
high speed* random access bipolar memory of modest siie which 
sits between the main memory and the processor. We learned aout 
the idea from private communications before the model 8b was 
released but were very skeptical, John Cock® of IBM influenced 
us by going through a listing of our LISP interpreter and 
Pointing out memory references that would hit the cache, We 
then wrote a simulator for PDPM0 programs which kept track of 
ceche usage for a hypothetical cache like that described by 
Cocke and found that more that 90% of all memory references hit 
*he cache, 

^l How big was the main memory* the programs* the cache? 
*' We started with a IK cache. We tried many programs 
^eluding some 50K compiled LISP* some big interpretive LISP* 
i0 *e large hand-coded* and some compiled ALGOL, The final 
^*»u!t was that for a 2K cache and for compiled ALGOL end 
' n terpreted LISP code* almost 99% of the memory cycles hit the 
: »che. 

, The simulations showed that compiled programs and LISP 
^terpreted programs were the best* compiled LISP and big 
^nd-. C oded programs (like our ALGOL compiler) were the worst, 
? "* LISP compiler compiling itself and running compiled had a 
ft -Ule over 9b% cache hits. The interpretation of this is that 
a *t memory references are for instructions, LISP interpreters 
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their time in small loops and compiled Progp en , t 
ie doing strings of PUSH'S* MOVE'S, and MoviM»s 
md coded tasks that leap around in clever way! 



9pend much of 

spend thei r t <m 

It is the ha 

following complicated control structures that suffer the low 

cache hit rate, We also varied certain cache parameters in 

simulation but found that for a greater than IK cache 

results were insensitive to most parameters. 
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We can't really afford (or know how to build) a 512 word 
associative memory that operates that fast so the cache really 
uses an approximation which is a hash coding algorithm with 
4-way conflict resolution. Thus, for a real memory address 
there are only four possible places in the cache where it might 
be t The process of searching the cache then reduces to 
simultaneously checking these four locations which are a simple 
function of the address. 
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Our paging box is logically on the bus between the 
processor and the cache. Thus the cache is part of the memory 
9ystem and addresses in it are physical memory rather that 
virtual addresses. This removes the problem of invalidating 
parts of the cache when a switch in paging mode occurs, Thus 
when a user transfers back and forth between his program and the 
system, to the cache it Just looks like he is calling in 
subroutines in his own core image, therby eliminating 
inefficiencies specifically related to system calls, 
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The main microcode control is in the E box. It has bl2 
U4-bit words which perform most of the control. The I box has 
its own microcode control memory to provide information about 
the fetching of operands and decoding of i nst rue t i ons. Thus* 
you really can change the instruction set because there is no 
connection between the opcodes- and what the I box does to them, 
It all goes through the I box control memory, 
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Did your previous work convince you or will there still 
e things left to tune? 

There is not much left to tune. Hardware constraints 
ined a lot of the factors: the quadword, 4-way 
ct resolution, and 2K overall cache size are all hardware 
aints. The way we are using the busses is essentially a 
re constraint. It would have been better to have four 

Can you wind this discussion down in some way? 
Buy Foonlyl 
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Ql What 1s your processor's relative Internal speed 

compared to a KA10? 

Al All ordinary Instructions take one cycle plus whatever 

1s necessary to fetch data, so an add between 2 accumulators 

takes 1 100-nenosecond cycle and an add to memory when the data 

comes from the cache takes 2 100«nanosecond cycles, That's much 

more than a factor of 10 over the KAtO # but It doesn't hit the 

cache all the 1 1 me, 

Ql Did you try your cache simulations on the monitor as 

well? 

Af Ho, we have no way of simulating the execution of 

monitor code, you can simulate the execution of user code 

Just by loading the simulator into the same core Image as tht 

user code but you can't do that with the monitor, We suspect 

the monitor may be the worst coie r but most monitor code Is 

executed at UUO level at user request and thus there 1s n© 

difference In having that code In the monitor and having 1t In 

the user program, This does not apply to Interrupt level 

operations Including scheduHngs, 

Ql Won't Jobs run at different elapsed CPU speeds depending 

on what happens around them? 

At Yes# Job speed depends upon what's 1n the c§che, 

PROCESSOR ARCHITECTURE 
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PRICE REDUCTIONS AND PRICING POLICY AT DEC 

Speaker: Bill Kieswater* DEC* Maynard* Mass, 

DEC'S POSITION RELATIVE TO SUPER FOONLY 

DEC is looking at Foonly and has been working with the 
staff at Stanford. We are one of the vendors but the extent of 
our participation is at the component level* not the system 
design. We are also making use of some of the design aids that 
they have developed. You can be assured that the 10 line will 
take advantage of these and similar advancements by our own 
research groups as they prove commercially feasible in terms of 
the tradeoffs. 

PROCESSOR PRICING 

The price of the KI1Q has been reduced from $380,000 to 
$240*000 (U.S. dollars). The KA10 price remains the same* but 
should Just the processor be required for an upgrade to a 
supported dual processor system* the incremental cost is 
$130*000, Similarly* a second KI10 processor should you already 
have a KI based system is $200*000, These savings result 
because the software installation and support component has been 
removed. 

NEW MEMORIES 

DEC is now producing memories in house. For the PDP-10 
the new ones are designated MF10A and MF10G, The MF10A is a 32K 
1 microsecond module in a 30 inch cabinet* compatible with any 
existing system in the field and priced at $50*000, The MF10G 
is 64K, 1 microsecond* fits in a 30 inch cabinet and sells at 
$80*000, Expectations are that as in-house memory production 
increases to satisfy the minicomputer' business* DEC-System 10 
memory will be coming down further in price, We are also 
looking at in-house production of peripherals such as swapping 
media* file storage* magtapes other than the TU10* but not line 
pri nters, 

EQUIPMENT UPGRADE PRICING 

The following upgrade pricing was presented! 



From 

KA10 

TM10A 

RP10A 

CR10A 

LP10A 

TU20 

TU30 

DS10 

MB10 

MB10 

MB10 

MA10 

MA10 

MA10 

MD10G 



To 

KI10 

TM10B 

RPiOC 

CR10E 

LP10C 

TU40 

TU«0 

DC75 

MF JO 

MF10A 

MF10G 

ME 10 

MF10A 

MF 10G 

MM0G 



Price 

130*000 
8,000 
9,000 
12*000 
36*000 
22*000 
17*000 
40*000 
22*000 
42*000 
72,000 
20*000 
40*000 
70,000 
as, 000 
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WDiOE MFIOA 35,000 

These upgrades are offered regardless of the condition fo the 
device unless 1t has been physically modified, The field 
service charge for the Installation and checkout 1s extra and 1s 
quoted on a local basis as a function of distance to the nearest 
field service office and so forth, I believe they have standard 
prices to do these upgrades, 

The intention 1s to offer customers an upgrade from tow 
performance devices to high Performance ones as their systems 
expand and usuage grows, As we develop products of a higher 
performance nature they will be offered as upgrades to the lower 
performance ones, I expect to see this as a formalized policy 
In the near future, 

[The following are summarized from the question/answer 
session which followed ...ed,J 

DISKS 

There 1a no upgrade from RP02's to RP03»s on the same 
controller because the pricing probably would not be attractive 
enough* 

DEC 1s committed to providing something beyond RP03's, 
1n the 3330 category, In the future. There are a lot of 
contingencies as to whether It be produced In house or on the 
outside. The RP04 has been discussed but no formal announcement 
of 1t has been made. It may not be offered for more than a 
yeer, 

MEMORY, MAGTAPES, DATA CHANNELS 

There 1s no MEtO to MFiO upgrade probably because there 
have been no requests, DEC probably could offer one. 

About t8 users Indicated an Interest In a low 
performance , low price but 1600 BPI magtape drive for 
compatibility with other vendors. 

A 22 bit DF10 data channel for use with KXlO't 1s 
currently being worked on as 1s a 22 bit MX10 memory port 
expander to go with 1t« 

TENEX 

DEC will supply an unsupported version of a TENEX 
monitor that runs on a KI10 for $15,000, The policy does not 
permit DEC to write any acceptance criteria for TENEX so the 
extent of the warranty 1s limited, TENEX was originally 
developed by Bolt, Beranek & Newman (BBN) to run on a modified 
KA10 (1500 wiring changes) with a BBN paging box f DEC has 
simulated the BBN pager In software on the KI10 and thus 
produced a TENEX monitor for It which seems to run at least as 
well as the KA10 version. No comparative performances 
measurements have yet been made, "TENEX was designed around a 
limited variety of DEC furnished peripherals so It may not 
support all configurations 1n the field. There are no plans 
Inside DEC at this time to support this monitor, The SPR's will 
be forwarded to BBN and they will handle them on a best effort 
basis, 
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The FORTRAN optlmlilng package will lease for $125 a 
month. It will be sent on the distribution tape, 

APL 

The current status of APL does not permit DEC to offer 
It to service bureaus as a result of a contractual agreement the 
owners of the language have with a service bureau t That 
agreement expires Novemberr 1973, If there Is enough Interest 
DEC will pursue It. There are two versions of APLl the basic 
version leases for $300/monthr the advanced version for 
$600/month, The latter Includes files and other features not 
available In IBM versions, Each version will work on either 
274l's or teletypes* the APL character set 1s not a requirement 
for Its use. 

UNBUNDLING 

DEC Is unable to predict what Items will be unbundled In 
the future, They view unbundling as a way of recouping funding 
now required to maintain the product In a first class state, 
There 1s a large group who do quality assurance^ software 
evaluation/ documentation and field support that Incur Indirect 
charges to the programming effort over and above Just the 
development programmers, As the prices of hardware goes down 1t 
reduces the margin to support such activities. 

DEC supplies and separately prices APL and data base 
management because there were strong requirements 1n specific 
markets for them, The selection of those Hems to be unbundled 
will probably be market driven and oriented toward the more 
specific application areas. 
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